A vocoder ( /ˈvoʊkoʊdər/, short for voice encoder) is an analysis/synthesis system, mostly used for speech. In the encoder, the input is passed through a multiband filter, each band is passed through an envelope follower, and the control signals from the envelope followers are communicated to the decoder. The decoder applies these (amplitude) control signals to corresponding filters in the (re)synthesizer.
It was originally developed as a speech coder for telecommunications applications in the 1930s, the idea being to code speech for transmission. Its primary use in this fashion is for secure radio communication, where voice has to be encrypted and then transmitted. The advantage of this method of "encryption" is that no 'signal' is sent, but rather envelopes of the bandpass filters. The receiving unit needs to be set up in the same channel configuration to resynthesize a version of the original signal spectrum. The vocoder as both hardware and software has also been used extensively as an electronic musical instrument.
Whereas the vocoder analyzes speech, transforms it into electronically transmitted information, and recreates it, The Voder (from Voice Operating Demonstrator) generates synthesized speech by means of a console with fifteen touch-sensitive keys and a pedal, basically consisting of the "second half" of the vocoder, but with manual filter controls, needing a highly trained operator.[1][2]
Contents |
The human voice consists of sounds generated by the opening and closing of the glottis by the vocal cords, which produces a periodic waveform with many harmonics. This basic sound is then filtered by the nose and throat (a complicated resonant piping system) to produce differences in harmonic content (formants) in a controlled way, creating the wide variety of sounds used in speech. There is another set of sounds, known as the unvoiced and plosive sounds, which are created or modified by the mouth in different fashions.
The vocoder examines speech by measuring how its spectral characteristics change over time. This results in a series of numbers representing these modified frequencies at any particular time as the user speaks. In simple terms, the signal is split into a number of frequency bands (the larger this number, the more accurate the analysis) and the level of signal present at each frequency band gives the instantaneous representation of the spectral energy content. Thus, the vocoder dramatically reduces the amount of information needed to store speech, from a complete recording to a series of numbers. To recreate speech, the vocoder simply reverses the process, processing a broadband noise source by passing it through a stage that filters the frequency content based on the originally recorded series of numbers. Information about the instantaneous frequency (as distinct from spectral characteristic) of the original voice signal is discarded; it wasn't important to preserve this for the purposes of the vocoder's original use as an encryption aid, and it is this "dehumanizing" quality of the vocoding process that has made it useful in creating special voice effects in popular music and audio entertainment.
Most analog vocoder systems use a number of frequency channels, all tuned to different frequencies (using band-pass filters). The various values of these filters are stored not as the raw numbers, which are all based on the original fundamental frequency, but as a series of modifications to that fundamental needed to modify it into the signal seen in the output of that filter. During playback these settings are sent back into the filters and then added together, modified with the knowledge that speech typically varies between these frequencies in a fairly linear way. The result is recognizable speech, although somewhat "mechanical" sounding. Vocoders also often include a second system for generating unvoiced sounds, using a noise generator instead of the fundamental frequency.
The first experiments with a vocoder were conducted in 1928 by Bell Labs engineer Homer Dudley, who was granted a patent for it on March 21, 1939.[4] The Vocoder was introduced to the public at the AT&T building at the 1939-1940 New York World's Fair.[2] Dudley's vocoder was used in the SIGSALY system, which was built by Bell Labs engineers (Alan Turing was briefly involved) in 1943. The SIGSALY system was used for encrypted high-level communications during World War II. Later work in this field has been conducted by James Flanagan.
RALCWI technology uses unique proprietary signal decomposition and parameter encoding methods, ensuring high voice quality at high compression ratios. The voice quality of RALCWI-class Vocoders, as estimated by independent listeners, is similar to that provided by standard Vocoders running at bit rates above 4000 bit/s. The Mean Opinion Score (MOS) of voice quality for this Vocoder is about 3.5-3.6. This value was determined by a paired comparison method, performing listening tests of developed and standard voice Vocoders.
The RALCWI Vocoder operates on a “frame-by-frame” basis. The 20ms source voice frame consists of 160 samples of linear 16-bit PCM sampled at 8 kHz. The Voice Encoder performs voice analysis at the high time resolution (8 times per frame) and forms a set of estimated parameters for each voice segment. All of the estimated parameters are quantized to produce 41-, 48- or 55-bit frames, using Vector Quantization (VQ) of different types. All of the vector quantizers were trained on a mixed multi-language voice base, which contains voice samples in both Eastern and Western languages.
Waveform-Interpolative (WI) vocoder was developed in AT&T Bell Laboratories around 1995 by W.B. Kleijn, and subsequently a low- complexity version was developed by AT&T for the DoD secure vocoder competition. Notable enhancements to the WI coder were made at the University of California, Santa Barbara. AT&T holds the core patents related to WI, and other institutes hold additional patents. Using these patents as a part of WI coder implementation requires licensing from all IPR holders.
The product is the result of a co-operation between CML Microcircuits and SPIRIT DSP. The co-operation combines CML’s 39-year history of developing mixed-signal semiconductors for professional and leisure communication applications, with SPIRIT’s experience in embedded voice products.
Since inception in 1992 SPIRIT is a Russian company that implements mostly standard audio and data communication software products, primarily outside the US.
Founded in 1968, CML Microcircuits are involved in the design, development and supply of low-power analogue, digital and mixed-signal semiconductors for telecommunications systems worldwide. CML IC’s, the CMX608, CMX618 and the CMX638 market to worldwide communication markets based upon SPIRIT’s proprietary Low Bit-Rate Vocoder technology.
The Voder (Voice Operating Demonstrator), an earlier speech synthesizer demonstrated in 1939,[2][media 1] was an initial research machine to test compression schemes for human voice along copper wires and voice encryption using radio.[citation needed] It consisted of a series of Oscillators using radio valves to produce tones, and gas discharge tubes to produce noise (hiss), the sound or output was modified using a series of filters. The filters were controlled by a set of keys and a foot pedal to convert the hisses and tones into vowels, consonants, and inflections. This was a complex machine to operate, and produce sounds similar to human speech. In 1948 Werner Meyer-Eppler recognized the capability of the Voder machine to generate electronic music.
Since the late 1970s, most non-musical vocoders have been implemented using linear prediction, whereby the target signal's spectral envelope (formant) is estimated by an all-pole IIR filter. In linear prediction coding, the all-pole filter replaces the bandpass filter bank of its predecessor and is used at the encoder to whiten the signal (i.e., flatten the spectrum) and again at the decoder to re-apply the spectral shape of the target speech signal.
One advantage of this type of filtering is that the location of the linear predictor's spectral peaks is entirely determined by the target signal, and can be as precise as allowed by the time period to be filtered. This is in contrast with vocoders realized using fixed-width filter banks, where spectral peaks can generally only be determined to be within the scope of a given frequency band. LP filtering also has disadvantages in that signals with a large number of constituent frequencies may exceed the number of frequencies that can be represented by the linear prediction filter. This restriction is the primary reason that LP coding is almost always used in tandem with other methods in high-compression voice coders.
Even with the need to record several frequencies, and the additional unvoiced sounds, the compression of the vocoder system is impressive. Standard speech-recording systems capture frequencies from about 500 Hz to 3400 Hz, where most of the frequencies used in speech lie, typically using a sampling rate of 8 kHz (slightly greater than the Nyquist rate). The sampling resolution is typically at least 12 or more bits per sample resolution (16 is standard), for a final data rate in the range of 96-128 kbit/s. However, a good vocoder can provide a reasonable good simulation of voice with as little as 2.4 kbit/s of data.
'Toll Quality' voice coders, such as ITU G.729, are used in many telephone networks. G.729 in particular has a final data rate of 8 kbit/s with superb voice quality. G.723 achieves slightly worse quality at data rates of 5.3 kbit/s and 6.4 kbit/s. Many voice systems use even lower data rates, but below 5 kbit/s voice quality begins to drop rapidly.
Several vocoder systems are used in NSA encryption systems:
(ADPCM is not a proper vocoder but rather a waveform codec. ITU has gathered G.721 along with some other ADPCM codecs into G.726.)
Vocoders are also currently used in developing psychophysics, linguistics, computational neuroscience and cochlear implant research.
Modern vocoders that are used in communication equipment and in voice storage devices today are based on the following algorithms:
For musical applications, a source of musical sounds is used as the carrier, instead of extracting the fundamental frequency. For instance, one could use the sound of a synthesizer as the input to the filter bank, a technique that became popular in the 1970s.
One of the first attempt to divert vocoder to create music may be a “Siemens Synthesizer” at Siemens Studio for Electronic Music, developed between 1956-1959.[10][media 2]
In 1968, Robert Moog developed one of the first solid-state musical vocoder for electronic music studio of University at Buffalo.[11][M 1]
In 1969, Bruce Haack built a prototype vocoder, named "Farad" after Michael Faraday,[12] and it was featured on his rock album The Electric Lucifer released in the same year.[13][media 3]
In 1970 Wendy Carlos and Robert Moog built another musical vocoder, a 10-band device inspired by the vocoder designs of Homer Dudley. It was originally called a spectrum encoder-decoder, and later referred to simply as a vocoder. The carrier signal came from a Moog modular synthesizer, and the modulator from a microphone input. The output of the 10-band vocoder was fairly intelligible, but relied on specially articulated speech. Later improved vocoders use a high-pass filter to let some sibilance through from the microphone; this ruins the device for its original speech-coding application, but it makes the "talking synthesizer" effect much more intelligible.
Carlos and Moog's vocoder was featured in several recordings, including the soundtrack to Stanley Kubrick's A Clockwork Orange in which the vocoder sang the vocal part of Beethoven's "Ninth Symphony". Also featured in the soundtrack was a piece called "Timesteps," which featured the vocoder in two sections. "Timesteps" was originally intended as merely an introduction to vocoders for the "timid listener", but Kubrick chose to include the piece on the soundtrack, much to the surprise of Wendy Carlos.
In 1972, Isao Tomita's first electronic music album Electric Samurai: Switched on Rock was an early attempt at applying speech synthesis technique through a vocoder[citation needed] in electronic rock and pop music. The album featured electronic renditions of contemporary rock and pop songs, while utilizing synthesized voices in place of human voices. In 1974, he utilized synthesized voices again in his popular classical music album Snowflakes are Dancing, which became a worldwide success and helped popularize electronic music.[14]
Kraftwerk's Autobahn (1974) was one of the first successful pop/rock albums to feature vocoder vocals. Another of the early songs to feature a vocoder was "The Raven" on the 1976 album Tales of Mystery and Imagination by progressive rock band The Alan Parsons Project; the vocoder also was used on later albums such as I Robot. Following Alan Parsons' example, vocoders began to appear in pop music in the late 1970s, for example, on disco recordings. Jeff Lynne of Electric Light Orchestra used the vocoder in several albums such as Time (featuring the Roland VP-330 Plus MkI). ELO songs such as "Mr. Blue Sky" and "Sweet Talkin' Woman" both from Out of the Blue (1977) use the vocoder extensively. Featured on the album are the EMS Vocoder 2000W MkI, and the EMS Vocoder (-System) 2000 (W or B, MkI or II).
Giorgio Moroder made extensive use of the vocoder on the 1975 album Einzelganger and on the 1977 album From Here to Eternity. Another example is Pink Floyd's album Animals, where the band put the sound of a barking dog through the device. Vocoders are often used to create the sound of a robot talking, as in the Styx song "Mr. Roboto". It was also used for the introduction to the Main Street Electrical Parade at Disneyland.
Vocoders have appeared on pop recordings from time to time ever since, most often simply as a special effect rather than a featured aspect of the work. However, many experimental electronic artists of the New Age music genre often utilize vocoder in a more comprehensive manner in specific works, such as Jean Michel Jarre (on Zoolook, 1984) and Mike Oldfield (on QE2, 1980 and Five Miles Out, 1982). There are also some artists who have made vocoders an essential part of their music, overall or during an extended phase. Examples include the German synthpop group Kraftwerk, Stevie Wonder ("Send One Your Love", "A Seed's a Star") and jazz/fusion keyboardist Herbie Hancock during his late 1970s period.
In 1982 Neil Young used a Sennheiser Vocoder VSM201 on six of the nine tracks on 'Trans'. [15]
"Robot voices" became a recurring element in popular music during the 20th century. Several methods of producing variations on this effect are: the Sonovox, Talk box, Auto-Tune,[media 4] linear prediction vocoders, speech synthesis, [media 5][media 6] ring modulation and comb filter.
In 1939, Alvino Rey used a carbon throat microphone wired in such a way as to modulate his electric steel guitar sound.[media 7] The mic, originally developed for military pilot communications, was placed on the throat of Rey's wife Luise King (one of The King Sisters), who stood behind a curtain and mouthed the words, along with the guitar lines. The novel-sounding combination was called "Singing Guitar", but was not developed further. Rey also created a somewhat similar "talking" effect, by manipulating the tone controls of his Fender electric guitar, but the vocal effect was less pronounced.[16]
Another early voice effect using the same principle of the throat as a filter was the Sonovox. Instead of a throat microphone modulating a guitar signal, it used small loudspeakers attached to the performer's throat.[1] It was used in films such as A Letter to Three Wives (1949), The Secret Life of Walter Mitty (1947), the voice of Casey Junior the train in Dumbo (1941) and The Reluctant Dragon (1941), the instruments in Rusty in Orchestraville, the piano in Sparky's Magic Piano, and the airplane in Whizzer The Talking Airplane (1947). The Sonovox was also used in many radio station IDs produced by PAMS of Dallas and JAM Creative Productions. Lucille Ball made one of her earliest film appearances during the 1930s in a Pathé Newsreel demonstrating the Sonovox.
The Sonovox makes an even earlier appearance in the 1940 film You'll Find Out starring Kay Kyser and his orchestra,[media 8] Bela Lugosi, Boris Karloff, and Peter Lorre. Lugosi uses the Sonovox to portray the voice of a dead person during a seance.
One of the earliest uses of a talk box appears in The Ventures' Christmas Album, released in 1965. In the song "Silver Bells", Red Rhodes spoke through a talk box, distorting the phrase silver bells.[17][media 9]
Vocoders are used in television production, filmmaking and games, usually for robots or talking computers.